Goto

Collaborating Authors

 reddit data


Navigating the Post-API Dilemma Search Engine Results Pages Present a Biased View of Social Media Data

arXiv.org Artificial Intelligence

Recent decisions to discontinue access to social media APIs are having detrimental effects on Internet research and the field of computational social science as a whole. This lack of access to data has been dubbed the Post-API era of Internet research. Fortunately, popular search engines have the means to crawl, capture, and surface social media data on their Search Engine Results Pages (SERP) if provided the proper search query, and may provide a solution to this dilemma. In the present work we ask: does SERP provide a complete and unbiased sample of social media data? Is SERP a viable alternative to direct API-access? To answer these questions, we perform a comparative analysis between (Google) SERP results and nonsampled data from Reddit and Twitter/X. We find that SERP results are highly biased in favor of popular posts; against political, pornographic, and vulgar posts; are more positive in their sentiment; and have large topical gaps. Overall, we conclude that SERP is not a viable alternative to social media API access.


Has Sentiment Returned to the Pre-pandemic Level? A Sentiment Analysis Using U.S. College Subreddit Data from 2019 to 2022

arXiv.org Artificial Intelligence

As impact of COVID-19 pandemic winds down, both individuals and society gradually return to pre-pandemic activities. This study aims to explore how people's emotions have changed from the pre-pandemic during the pandemic to post-emergency period and whether it has returned to pre-pandemic level. We collected Reddit data in 2019 (pre-pandemic), 2020 (peak pandemic), 2021, and 2022 (late stages of pandemic, transitioning period to post-emergency period) from subreddits in 128 universities/colleges in the U.S., and a set of school-level characteristics. We predicted two sets of sentiments from a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) and graph attention network (GAT) that leverages both rich semantic and relational information among posted messages and then applied a logistic stacking method to obtain the final sentiment classification. After obtaining sentiment label for each message, we used a generalized linear mixed-effects model to estimate temporal trend in sentiment from 2019 to 2022 and how school-level factors may affect sentiment. Compared to the year 2019, the odds of negative sentiment in years 2020, 2021, and 2022 are 24%, 4.3%, and 10.3% higher, respectively, which are all statistically significant(adjusted $p$<0.05). Our study findings suggest a partial recovery in the sentiment composition in the post-pandemic-emergency era. The results align with common expectations and provide a detailed quantification of how sentiments have evolved from 2019 to 2022.


Parameter-Efficient Legal Domain Adaptation

arXiv.org Artificial Intelligence

Seeking legal advice is often expensive. Recent advancements in machine learning for solving complex problems can be leveraged to help make legal services more accessible to the public. However, real-life applications encounter significant challenges. State-of-the-art language models are growing increasingly large, making parameter-efficient learning increasingly important. Unfortunately, parameter-efficient methods perform poorly with small amounts of data, which are common in the legal domain (where data labelling costs are high). To address these challenges, we propose parameter-efficient legal domain adaptation, which uses vast unsupervised legal data from public legal forums to perform legal pre-training. This method exceeds or matches the fewshot performance of existing models such as LEGAL-BERT on various legal tasks while tuning only approximately 0.1% of model parameters. Additionally, we show that our method can achieve calibration comparable to existing methods across several tasks. To the best of our knowledge, this work is among the first to explore parameter-efficient methods of tuning language models in the legal domain.


An AI Twitter bot that only tweets good news, with Python and GPT2

#artificialintelligence

Running AI these days is increasingly simple due to the hard work of open source contributors producing top-notch libraries out there, and research groups opening up their work so others can build on it. One key library doing that is HuggingFace's Transformers library. HuggingFace are a startup building, amongst other NLP-related products, a library and model ecosystem that allows almost anyone to quickly and easily set up AI-powered chat bots that can consume or produce natural language. In this post, I'll demonstrate how I used this library to produce a Twitter bot that is only tweeting made-up (and slightly quirky) good news This blog post isn't meant to explain any theory, but for those who aren't familiar, the easiest way to explain this kind of AI, is they're sophisticated pattern recognition systems. If you feed it enough data, it can build up an ability to recognize the patterns in the english language, to the extent that if you ask it to repeat the pattern, not only will it generate mostly correct English grammar, it might also from time to time generate a coherent sentence!


Social media data reveals signal for public consumer perceptions

arXiv.org Artificial Intelligence

Researchers have used social media data to estimate various macroeconomic indicators about public behaviors, mostly as a way to reduce surveying costs. One of the most widely cited economic indicator is consumer confidence index (CCI). Numerous studies in the past have focused on using social media, especially Twitter data, to predict CCI. However, the strong correlations disappeared when those models were tested with newer data according to a recent comprehensive survey. In this work, we revisit this problem of assessing the true potential of using social media data to measure CCI, by proposing a robust non-parametric Bayesian modeling framework grounded in Gaussian Process Regression (which provides both an estimate and an uncertainty associated with it). Integral to our framework is a principled experimentation methodology that demonstrates how digital data can be employed to reduce the frequency of surveys, and thus periodic polling would be needed only to calibrate our model. Via extensive experimentation we show how the choice of different micro-decisions, such as the smoothing interval, various types of lags etc. have an important bearing on the results. By using decadal data (2008-2019) from Reddit, we show that both monthly and daily estimates of CCI can, indeed, be reliably estimated at least several months in advance, and that our model estimates are far superior to those generated by the existing methods.


Open Domain Dialogue Generation with Latent Images

arXiv.org Artificial Intelligence

We consider grounding open domain dialogues with images. Existing work assumes that both an image and a textual context are available, but image-grounded dialogues by nature are more difficult to obtain than textual dialogues. Thus, we propose learning a response generation model with both image-grounded dialogues and textual dialogues by assuming that there is a latent variable in a textual dialogue that represents the image, and trying to recover the latent image through text-to-image generation techniques. The likelihood of the two types of dialogues is then formulated by a response generator and an image reconstructor that are learned within a conditional variational auto-encoding framework. Empirical studies are conducted in both image-grounded conversation and text-based conversation. In the first scenario, image-grounded dialogues, especially under a low-resource setting, can be effectively augmented by textual dialogues with latent images; while in the second scenario, latent images can enrich the content of responses and at the same time keep them relevant to contexts.


microsoft/DialoGPT

#artificialintelligence

This repository contains the source code and trained model for a large-scale pretrained dialogue response generation model. The human evaluation results indicate that the response generated from DialoGPT is comparable to human response quality under a single-turn conversation Turing test. The repository is based on huggingface pytorch-transformer and OpenAI GPT-2, containing data extraction script, model training code and pretrained small (117M) medium (345M) and large (762M) model checkpoint. The model is trained on 147M multi-turn dialogue from Reddit discussion thread. The largest model can be trained in several hours on a 8 V100 machines (however this is not required), with distributed training and FP16 option.


Reddit Data Led an A.I. Named Norman to Obsess Over Murder, MIT Finds

#artificialintelligence

The advancement of artificial intelligence is concerning to many, from those who fear the growing data collection of A.I. assistants to Elon Musk and his fears that "super-intelligence" will bring humanity's end. How can scientists prevent an A.I. from destroying human civilization? To answer this question, scientists might need to study at an A.I. gone bad. Such was the impetus behind Norman, the robot considered to be the "world's first psychopathic A.I." A team of scientists from Scalable Cooperation at the MIT Media Lab, led by Pinar Yanardag, Manuel Cebrian, and Iyad Rahwan, fed the A.I. biased data to see how it might influence its behavior. In April, the team began to expose Norman to potentially damaging biases and bad data to later see how the image-captioning robot "sees" pictures.


Inductive Representation Learning on Large Graphs

arXiv.org Machine Learning

Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classification benchmarks: we classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.